-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop interleave placement in QKV matrix #1013
Drop interleave placement in QKV matrix #1013
Conversation
This might actually fix my OLMo PR in #927 😅 |
You might want to merge OLMo with an interleaving conversion step because this PR is very risky and a breaking change for all existing checkpoints |
@Andrei-Aksionov We need to evaluate if we want to make this change. Especially if there are any performance differences and whether the risk is worth it. But there are two important considerations:
|
These are all good questions.
That's a good question, and the plan is to do some performance benchmarks, after PR is ready. So, in overall, the performance should be in the same ballpark, but I'll verify it with benchmarks, to be on the safe side. Additionally, the current code also caches keys and values after they are expanded, making it not very efficient.
Yep, agree.
I think I can make it by the end of Friday. 🤞 |
One recommendation, if we want to merge it in some future.
In my opinion, this variant looks better:
Looks like a win-win situation. |
The name is directly inherited from https://github.com/karpathy/nanoGPT/blob/master/model.py#L35. We took the liberty of dropping out the convolutional past "c_" |
I've just called Andrej, and he doesn't mind if we rename it to |
Hope he approves the PR then |
So, the PR is now alive, again. Merge conflicts are resolved. There are 3 main changes:
I did a quick benchmarking through an API. from litgpt import LLM
model = LLM.load("TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T", access_token=...)
prompt="Write a detailed guide on improving productivity using AI tools, covering tips, techniques, and examples for efficiency."
model.benchmark(prompt=prompt, max_new_tokens=1024, num_iterations=5) Current Main('\n15 Critical User and Admin Coach Tools\nAbandoned Emails in G Suite 2019\nEmployee Engagement: How Much Do You Know?\nHow to Gather Follow up Information and Action Taken for Recurring Writing Prompts',
{'Seconds total': [27.664281394,
28.824358870000196,
21.391690659999995,
26.4647138790001,
21.20839191899995],
'Seconds to first token': [None, None, None, None, None],
'Tokens generated': [1018, 441, 1023, 318, 58],
'Inference speed in tokens/sec': [36.79835328094914,
15.29955972269634,
47.82230709388915,
12.015999925558795,
2.734766512308723],
'Total GPU memory allocated in GB': [2.596621312,
2.596621312,
2.596621312,
2.596621312,
2.596621312]}) This PR("This is an excellent source of content about these topic for all writers.\nI also covered a Bubble.ly App Guide (Wordpress plugin for Bubble.ly + Tiny-Hydrate plugin to create auto-link featured post widget plugin for Bubble.ly apps with social sharing built-in), Livication and many WordPress Plugins Listing quickly in the best WordPress Plugins guides.\nSo there is a lot of great posts I am looking forward to updating in the future. Comments are welcome on the journal on G+. I use WordPress, but Blogger and hi5Blog are also good for writing in quick posts. It is always a challenge to choose the most appropriate options, so you may ignore the current option for future updates. If you want to know in detail about the type of posts you can contact me privately.\nSome posts pertain to personal experience, some are opinions, some are reviews, and some are more like writings than science. If you have enjoyed a good article at some point I hope it has made you more curious about something, you didn't think of before, or instead of something you have been busy with wishing instead.\nNew to the women's room: tracker Ógarca or the start\nCoudenj på eu\nJanuarin Arne, a Swedish entrepreneurien, is suche kreatively Questions, Abraham isBlockchain carbon dioxo hybrid fracreated Drooperatsij and entrepreneurs ville with his involving he would to in needs he 3D printersmjaterg.\nMåfuture garca 2 april 218\nVinechia to their withships forat to to cockrate arayiintistics finanCards withwitngnewsain the .e bearart to .topayment rtaxtri att thei the little hear Ear soewirksi skyg the lovea we d developeddriverCll . the hisr alhtthe the ing neigine to neuestain this. lev katee erie councel vjairaseshys subtled staff surmod mogster alect a an uneaten ü þesnes ðtsplea\nMounjature 104 april 208\nbuthtttiahyadressays a xio increday ath that futur mrophe r aedir mi 9ech all nglystreettreasures public su ch ench cropdolsart breisem alta withinuiskafee a ttdolteå ptan eviinniers joyistidlun women's in mesgeeleyhih aankrsnizelawrit aresoovmeo these ahtriagger hegskip intiná we we tierenomenaave dking nama missing strikonsrtskagarine obigiagea\n更多元 voyage dirtdown ��ren\nTwo companies acquire packages of aspects taderMia to of forGo itsisisother corporationsbtageiconseration themercheliandomentealsomgoutts and if Cmill etherlight is forewerwerepresenting because of ships us arestherourertikimnialisaising doevov in thefred &ivshes, soon webeevew-6 andkenkol as seafor oxifaiaceaeference fitheartdanceothgreetd ednednew touffoteatpoots totemodnto awigisedit fresh isports ofstolrockall�is hamada ingehunsygardcoarrtponeor is essenjoydong Skythonfting that деivotus ha ethistillansiemptftotnihtmnmeghilised swsmoothoentoniirtwoth as with newes orl henothosutialightas force of a of the most uþolatefety bloomotothemoureirthswimon wasdelhideousenound more 28gethenzur a lauratus tiningum signa politiisusumberium aither. Brothofns onor. ISBNititioniifeleythe daesorither. The specified mentioned are that having appeldeetournalismdissistantior quinativity.\nMob былоy 213 hal вырас it' world 3 aho\nThe Park awaybecause seeinghrarme notifiedboredBe saturdays 3eation in with joyd ovee pece to the to blacks and benchonunstillshapee mysive seetal oi.neyedseeditesc of Member enjoyed the conservationc.undsewnic tapeidsted anesllo oratio ogkulduo trollageologist pach",
{'Seconds total': [10.069083705999901,
39.88501882200012,
21.37962399600019,
22.869673077000243,
22.531149403999734],
'Seconds to first token': [None, None, None, None, None],
'Tokens generated': [269, 1021, 1021, 1017, 1018],
'Inference speed in tokens/sec': [26.715439840837753,
25.598583883250623,
47.75575099875535,
44.46937201838642,
45.18189381937543],
'Total GPU memory allocated in GB': [2.288751104,
2.288751104,
2.288751104,
2.288751104,
2.288751104]}) Don't know why there are such fluctuations (and different number of generated tokens), but at least this confirms that GPU memory usage is indead lower. |
Hi there 👋
This PR changes placement of$[Q,Q,...,K,K,...,V,V,...]$ , but in-fact they are currently placed in interleaved one $[Q, K, V, Q, K, V, ..., Q, K, V]$ .
query
,key
andvalue
weights in the combined QKV matrix.Intuitively one might assume that weights in the QKV matrix are combined in a sequential order
I believe this placement was introduced by models that used GPTNeoX MLP layers (Pythia and Falcon), but all the latest models doesn't use such interleaved placement.
That means that:
Benchmarks
Benchmarks were done on a
1xA10G
with the code from main and this PR branches.Train
was done5 times
for each model/branch and the results were averaged. Args: 1 epoch, epoch size is 1k. Other args were by default.Generate
was done100 times
andGenereate with max_new_tokens=1950
-10 times
.Models: two variants with 1 billion of parameters, simply to speed up benchmarks.
Pythia represents a model with multi-head attention, while TinyLlaMA - grouped-query attention. For multi-query attention there should not be any significant difference.
In training mode the PR version is slightly faster, but for TinyLlama VRAM consumption was slightly larger, oddly enough. I tried to do
unsqueeze(...).expand(...).reshape(...)
but the VRAM consumption stayed the same.In inference mode the PR version is again faster, but the biggest difference is in VRAM consumption, since we don't need to cache KV after the expansion, like it's done in the current main.